Goto

Collaborating Authors

 target encoder


0918183ced31affb7ce0345e45ac1943-Supplemental-Conference.pdf

Neural Information Processing Systems

We evaluate Okapi using three datasets - iWildCam, PovertyMap, and CivilComments - taken from the WILDS 2.0 benchmark [63]. These datasets were chosen specifically due to the poor performance reported by [63] for semi-supervised and domain adaptation methods across the board, in relation to the ERM baselines. For PovertyMap in particular, ERM was found to vastly outperform any competing methods utilising the unlabelled data and/or domain labels. The task is multiclass species classification of animals in camera trap images. The dataset contains 1022K images of animals annotated with the domain, s, that identifies the camera trap that captured it.




ConnectingJoint-EmbeddingPredictiveArchitecture withContrastiveSelf-supervisedLearning

Neural Information Processing Systems

Figure 1: Our C-JEPA achieves faster and betterconvergencethanI-JEPA. Unsupervised learning ofvisual representations has recently seen remarkable progress, primarily due to the development of innovative architectures and strategies that exploit unlabeled imagery.




Unsupervised Training of Vision Transformers with Synthetic Negatives

arXiv.org Artificial Intelligence

This paper does not introduce a novel method per se. Instead, we address the neglected potential of hard negative samples in self-supervised learning. Previous works explored synthetic hard negatives but rarely in the context of vision transformers. We build on this observation and integrate synthetic hard negatives to improve vision transformer representation learning. This simple yet effective technique notably improves the discriminative power of learned representations. Our experiments show performance improvements for both DeiT-S and Swin-T architectures.


Audio-JEPA: Joint-Embedding Predictive Architecture for Audio Representation Learning

arXiv.org Artificial Intelligence

Self-Supervised Learning ( SSL) has revolutionized representation learning for speech and audio, enabling models to learn from unlabeled data and excel in diverse downstream tasks [ 1, 2, 3, 4 ] . Early SSL approaches for audio, such as contrastive predictive coding and wav2vec 2.0, learned latent speech representations by masking the input and solving a contrastive task over latent codes [ 5 ] . Follow-up methods like HuBERT [ 1 ] introduced offline clustering to generate pseudo-labels for masked audio segments and WavLM [ 6 ] applied data augmentation and denoising to improve robustness in speech representation learning. More recently, latent prediction approaches have gained traction: data2vec [ 7 ] and its efficient successor data2vec 2.0 [ 8 ] employ a teacher-student framework to predict contextualized latent representations of the input, achieving strong results across vision, speech, and language tasks. In the audio domain, Niizumi et al. introduced Masked Modeling Duo (M2D) [ 4 ], which uses two networks (online and momentum encoder) to predict masked patch embeddings and attained state-of-the-art results on numerous audio benchmarks. In computer vision, a new paradigm called Joint-Embedding Predictive Architecture (JEP A) [ 9, 10, 11 ] has been proposed to predict hidden content in a high-level latent space instead of pixel space.


BESA: Boosting Encoder Stealing Attack with Perturbation Recovery

arXiv.org Artificial Intelligence

--T o boost the encoder stealing attack under the perturbation-based defense that hinders the attack performance, we propose a boosting encoder stealing attack with perturbation recovery named BESA. It aims to overcome perturbation-based defenses. The core of BESA consists of two modules: perturbation detection and perturbation recovery, which can be combined with canonical encoder stealing attacks. The perturbation detection module utilizes the feature vectors obtained from the target encoder to infer the defense mechanism employed by the service provider . Once the defense mechanism is detected, the perturbation recovery module leverages the well-designed generative model to restore a clean feature vector from the perturbed one. Through extensive evaluations based on various datasets, we demonstrate that BESA significantly enhances the surrogate encoder accuracy of existing encoder stealing attacks by up to 24.63% when facing state-of-the-art defenses and combinations of multiple defenses. Pre-trained encoders are extensively utilized across various domains in real-world scenarios [1]. However, training well-performing pre-trained encoders is a time-consuming, resource-intensive, and costly process [2]. Hence, encoder owners are highly motivated to safeguard the privacy of their pre-trained encoders. Unfortunately, recent works have shown that pre-trained encoders are susceptible to encoder stealing attacks [3]. These attacks allow an attacker to create a surrogate encoder that closely mimics the functionality of a targeted encoder by simply querying it through the APIs. The consequences of such attacks can be quite severe.


Building Bridges between Regression, Clustering, and Classification

arXiv.org Machine Learning

Regression, the task of predicting a continuous scalar target y based on some features x is one of the most fundamental tasks in machine learning and statistics. It has been observed and theoretically analyzed that the classical approach, meansquared error minimization, can lead to suboptimal results when training neural networks. In this work, we propose a new method to improve the training of these models on regression tasks, with continuous scalar targets. Our method is based on casting this task in a different fashion, using a target encoder, and a prediction decoder, inspired by approaches in classification and clustering. We showcase the performance of our method on a wide range of real-world datasets.